Remove unnecessary zeroing in Triton MXFP8 dequantize kernel by Micky774 · Pull Request #516 · ROCm/TransformerEngine

Micky774 · 2026-04-01T20:28:05Z

Description

These improvements were spotted by Claude during the development/practice of a gpu-profiling skill targeted at triton MXFP8 casting. These changes improve the Triton kernel's performance by about 10-70% (biggest gains seen in larger configs). With these changes, Triton kernel dequantization is now closer to HIP kernel performance for small matrices, and even outperforms HIP kernel dequantization on the largest matrices.

Benchmarks were generated by bench_mxfp8_cast.py:

On dev:

MXFP8 triton cast benchmark  |  dtype=bf16  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.065         0.042      48.2      74.8
       2048x2048       0.065         0.042     193.8     299.9
       4096x4096       0.063         0.041     797.7    1216.3
       4096x8192       0.062         0.041    1611.6    2430.9
       8192x8192       0.104         0.062    1938.0    3239.0
      4096x16384       0.111         0.063    1817.0    3186.9
      16384x4096       0.103         0.063    1951.9    3199.1

MXFP8 triton cast benchmark  |  dtype=fp16  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.063         0.041      49.8      76.8
       2048x2048       0.063         0.041     199.5     307.6
       4096x4096       0.063         0.041     795.6    1218.8
       4096x8192       0.063         0.041    1603.0    2441.7
       8192x8192       0.106         0.061    1905.2    3277.1
      4096x16384       0.106         0.064    1893.4    3153.6
      16384x4096       0.102         0.065    1973.3    3109.6

MXFP8 triton cast benchmark  |  dtype=fp32  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.066         0.043      79.0     122.5
       2048x2048       0.065         0.043     323.1     487.9
       4096x4096       0.065         0.043    1294.1    1957.2
       4096x8192       0.066         0.048    2545.1    3474.8
       8192x8192       0.122         0.116    2747.6    2900.0
      4096x16384       0.138         0.118    2423.9    2849.9
      16384x4096       0.111         0.114    3012.5    2945.5

With PR:

MXFP8 triton cast benchmark  |  dtype=bf16  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.066         0.037      47.8      84.1
       2048x2048       0.065         0.037     192.7     337.1
       4096x4096       0.065         0.037     771.3    1348.3
       4096x8192       0.065         0.038    1544.9    2670.2
       8192x8192       0.104         0.052    1929.3    3846.8
      4096x16384       0.113         0.054    1785.7    3731.9
      16384x4096       0.103         0.037    1953.4    5386.2

MXFP8 triton cast benchmark  |  dtype=fp16  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.066         0.037      48.0      85.3
       2048x2048       0.065         0.037     193.9     338.4
       4096x4096       0.065         0.038     771.2    1339.4
       4096x8192       0.065         0.038    1537.5    2673.5
       8192x8192       0.105         0.052    1915.5    3849.2
      4096x16384       0.113         0.051    1780.9    3939.0
      16384x4096       0.103         0.038    1950.7    5291.4

MXFP8 triton cast benchmark  |  dtype=fp32  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.063         0.036      83.2     145.9
       2048x2048       0.062         0.036     337.9     581.6
       4096x4096       0.063         0.036    1341.5    2317.5
       4096x8192       0.062         0.036    2693.3    4639.3
       8192x8192       0.122         0.089    2746.8    3766.4
      4096x16384       0.138         0.097    2434.2    3455.5
      16384x4096       0.113         0.079    2980.1    4253.4

HIP kernel on dev:

MXFP8 triton cast benchmark  |  dtype=bf16  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.018         0.013     174.4     234.9
       2048x2048       0.018         0.014     713.5     885.5
       4096x4096       0.023         0.020    2153.1    2579.0
       4096x8192       0.041         0.024    2441.5    4183.2
       8192x8192       0.073         0.039    2743.6    5182.0
      4096x16384       0.074         0.040    2721.1    5096.1
      16384x4096       0.076         0.044    2654.6    4547.4

MXFP8 triton cast benchmark  |  dtype=fp16  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.019         0.011     167.4     293.8
       2048x2048       0.018         0.011     703.9    1167.5
       4096x4096       0.023         0.016    2228.2    3191.3
       4096x8192       0.041         0.020    2478.1    5044.7
       8192x8192       0.073         0.038    2761.9    5339.4
      4096x16384       0.071         0.040    2828.4    5002.5
      16384x4096       0.074         0.041    2721.5    4856.7

MXFP8 triton cast benchmark  |  dtype=fp32  iters=500
           Shape  Quant (ms)  Dequant (ms)    Q GB/s   DQ GB/s
------------------------------------------------------------------
       1024x1024       0.018         0.011     283.6     496.7
       2048x2048       0.018         0.013    1173.7    1660.8
       4096x4096       0.031         0.022    2730.8    3740.5
       4096x8192       0.047         0.034    3534.8    4958.0
       8192x8192       0.098         0.096    3412.8    3495.9
      4096x16384       0.098         0.100    3429.4    3360.2
      16384x4096       0.099         0.103    3390.0    3247.3

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Replaces an unnecessary torch.zeros with torch.empty
Updates GROUP_Y size for dequantization

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Micky774 · 2026-04-03T17:56:57Z

Note that CI level 3 tests already passed before force-pushing to reconcile dev altered history.

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners April 1, 2026 20:28

Micky774 added the ci-level 3 CI test level 3 label Apr 1, 2026

Micky774 changed the title ~~Remove unnecessary zeroing in Triton MXFP8 dequantize kernels~~ Remove unnecessary zeroing in Triton MXFP8 dequantize kernel Apr 1, 2026

wangye805 force-pushed the dev branch from 2f66594 to e15cc70 Compare April 2, 2026 02:16

Minor improvements

f56564b

Micky774 force-pushed the zain/triton-mxfp8-empty branch from 6e9b670 to f56564b Compare April 3, 2026 17:55

wangye805 approved these changes Apr 3, 2026

View reviewed changes

matthiasdiener approved these changes Apr 3, 2026

View reviewed changes

wangye805 merged commit 1c949a5 into dev Apr 4, 2026
3 checks passed

Micky774 deleted the zain/triton-mxfp8-empty branch April 6, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary zeroing in Triton MXFP8 dequantize kernel#516

Remove unnecessary zeroing in Triton MXFP8 dequantize kernel#516
wangye805 merged 1 commit intodevfrom
zain/triton-mxfp8-empty

Micky774 commented Apr 1, 2026 •

edited

Loading

Uh oh!

Micky774 commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Micky774 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Micky774 commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Micky774 commented Apr 1, 2026 •

edited

Loading